56 research outputs found

    Efficient Generator of Mathematical Expressions for Symbolic Regression

    Full text link
    We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathematical expressions and can accurately encode expressions into a smooth low-dimensional latent space. The latter can be efficiently explored with various optimization methods to address the task of symbolic regression. Indeed, random search through the latent space of HVAE performs better than random search through expressions generated by manually crafted probabilistic grammars for mathematical expressions. Finally, EDHiE system for symbolic regression, which applies an evolutionary algorithm to the latent space of HVAE, reconstructs equations from a standard symbolic regression benchmark better than a state-of-the-art system based on a similar combination of deep learning and evolutionary algorithms.\v{z}Comment: 35 pages, 11 tables, 7 multi-part figures, Machine learning (Springer) and journal track of ECML/PKDD 202

    MLFMF: Data Sets for Machine Learning for Mathematical Formalization

    Full text link
    We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean~4 library Mathlib, and some of the largest Agda libraries: the standard library, the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of s-expressions representing the syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the s-expressions give complete and easily parsed information about every entry. We report baseline results using standard graph and word embeddings, tree ensembles, and instance-based learning algorithms. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. The methodology used to extract the networks and the s-expressions readily applies to other libraries, and is applicable to other proof assistants. With more than 250ā€‰000250\,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format.Comment: NeurIPS 202

    Parameter estimation with bio-inspired meta-heuristic optimization: modeling the dynamics of endocytosis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We address the task of parameter estimation in models of the dynamics of biological systems based on ordinary differential equations (ODEs) from measured data, where the models are typically non-linear and have many parameters, the measurements are imperfect due to noise, and the studied system can often be only partially observed. A representative task is to estimate the parameters in a model of the dynamics of endocytosis, i.e., endosome maturation, reflected in a cut-out switch transition between the Rab5 and Rab7 domain protein concentrations, from experimental measurements of these concentrations. The general parameter estimation task and the specific instance considered here are challenging optimization problems, calling for the use of advanced meta-heuristic optimization methods, such as evolutionary or swarm-based methods.</p> <p>Results</p> <p>We apply three global-search meta-heuristic algorithms for numerical optimization, i.e., differential ant-stigmergy algorithm (DASA), particle-swarm optimization (PSO), and differential evolution (DE), as well as a local-search derivative-based algorithm 717 (A717) to the task of estimating parameters in ODEs. We evaluate their performance on the considered representative task along a number of metrics, including the quality of reconstructing the system output and the complete dynamics, as well as the speed of convergence, both on real-experimental data and on artificial pseudo-experimental data with varying amounts of noise. We compare the four optimization methods under a range of observation scenarios, where data of different completeness and accuracy of interpretation are given as input.</p> <p>Conclusions</p> <p>Overall, the global meta-heuristic methods (DASA, PSO, and DE) clearly and significantly outperform the local derivative-based method (A717). Among the three meta-heuristics, differential evolution (DE) performs best in terms of the objective function, i.e., reconstructing the output, and in terms of convergence. These results hold for both real and artificial data, for all observability scenarios considered, and for all amounts of noise added to the artificial data. In sum, the meta-heuristic methods considered are suitable for estimating the parameters in the ODE model of the dynamics of endocytosis under a range of conditions: With the model and conditions being representative of parameter estimation tasks in ODE models of biochemical systems, our results clearly highlight the promise of bio-inspired meta-heuristic methods for parameter estimation in dynamic system models within system biology.</p

    Reconstructing dynamical networks via feature ranking

    Full text link
    Empirical data on real complex systems are becoming increasingly available. Parallel to this is the need for new methods of reconstructing (inferring) the topology of networks from time-resolved observations of their node-dynamics. The methods based on physical insights often rely on strong assumptions about the properties and dynamics of the scrutinized network. Here, we use the insights from machine learning to design a new method of network reconstruction that essentially makes no such assumptions. Specifically, we interpret the available trajectories (data) as features, and use two independent feature ranking approaches -- Random forest and RReliefF -- to rank the importance of each node for predicting the value of each other node, which yields the reconstructed adjacency matrix. We show that our method is fairly robust to coupling strength, system size, trajectory length and noise. We also find that the reconstruction quality strongly depends on the dynamical regime

    Classification tree analysis of second neoplasms in survivors of childhood cancer

    Get PDF
    BACKGROUND: Reports on childhood cancer survivors estimated cumulative probability of developing secondary neoplasms vary from 3,3% to 25% at 25 years from diagnosis, and the risk of developing another cancer to several times greater than in the general population. METHODS: In our retrospective study, we have used the classification tree multivariate method on a group of 849 first cancer survivors, to identify childhood cancer patients with the greatest risk for development of secondary neoplasms. RESULTS: In observed group of patients, 34 develop secondary neoplasm after treatment of primary cancer. Analysis of parameters present at the treatment of first cancer, exposed two groups of patients at the special risk for secondary neoplasm. First are female patients treated for Hodgkin's disease at the age between 10 and 15 years, whose treatment included radiotherapy. Second group at special risk were male patients with acute lymphoblastic leukemia who were treated at the age between 4,6 and 6,6 years of age. CONCLUSION: The risk groups identified in our study are similar to the results of studies that used more conventional approaches. Usefulness of our approach in study of occurrence of second neoplasms should be confirmed in larger sample study, but user friendly presentation of results makes it attractive for further studies
    • ā€¦
    corecore